Introduction

Down sampling is typically done when the input tables are large (e.g. each containing more than 100K tuples). This IPython notebook illustrates how to down sample two large tables that are loaded in memory.

First, we need to import py_entitymatching package as follows:


In [2]:
import py_entitymatching as em

Next, we need to read in the input tables. For the purposes of this notebook we will use two large datasets: Citeseer and DBLP. You can download Citeseer dataset from http://pages.cs.wisc.edu/~anhai/data/falcon_data/citations/citeseer.csv and DBLP dataset from http://pages.cs.wisc.edu/~anhai/data/falcon_data/citations/dblp.csv. Once downloaded, save these datasets as 'citeseer.csv' and 'dblp.csv' in the current directory.


In [5]:
# Read the CSV files
A = em.read_csv_metadata('./citeseer.csv',low_memory=False) # setting the parameter low_memory to False  to speed up loading.
B = em.read_csv_metadata('./dblp.csv', low_memory=False)

In [6]:
len(A), len(B)


Out[6]:
(1823978, 2512927)

In [7]:
A.head()


Out[7]:
id title authors journal month year publication_type
0 1 An Arithmetic Analogue of Bezouts Theorem David Mckinnon NaN NaN NaN NaN
1 2 Thompsons Group F is Not Minimally Almost Convex James Belk, Kai-uwe Bux NaN NaN 2002.0 NaN
2 3 Cognitive Dimensions Tradeoffs in Tangible User Interface Design Darren Edge, Alan Blackwell NaN NaN NaN NaN
3 4 ACTIVITY NOUNS, UNACCUSATIVITY, AND ARGUMENT MARKING IN YUKATEKAN SSILA meeting; Special Session... J. Bohnemeyer, Max Planck, I. Introduction NaN NaN 2002.0 NaN
4 5 PS1-6 A6 ULTRASOUND-GUIDED HIFU NEUROLYSIS OF PERIPHERAL NERVES TO TREAT SPASTICITY AND J. L. Foley, J. W. Little, F. L. Starr Iii, C. Frantz NaN NaN NaN NaN

In [8]:
B.head()


Out[8]:
id title authors journal month year publication_type
0 1 Klaus Tschira Stiftung gemeinntzige GmbH, KTS Klaus Tschira NaN NaN 2012 www
1 2 The SGML/XML Web Page Robin Cover NaN NaN 2006 www
2 3 The Future of Classic Data Administration: Objects + Databases + CASE Arnon Rosenthal NaN NaN 1998 www
3 4 XML Query Data Model Mary F. Fernandez, Jonathan Robie NaN NaN 2001 www
4 5 The XML Query Algebra Peter Fankhauser, Mary F. Fernndez, Ashok Malhotra, Michael Rys, Jrme Simon, Philip Wadler NaN NaN 2001 www

In [9]:
# Set 'id' as the keys to the input tables
em.set_key(A, 'id')
em.set_key(B, 'id')


Out[9]:
True

In [10]:
# Display the keys
em.get_key(A), em.get_key(B)


Out[10]:
('id', 'id')

Downsample the Input Tables

Now, the input tables can be down sampled as shown below:


In [12]:
# Downsample the datasets 
sample_A, sample_B = em.down_sample(A, B, size=1000, y_param=1, show_progress=False)

In the down_sample command, set the size to the number of tuples that should be sampled from B (this would be the size of sampled B table) and set the y_param to be the number of tuples to be picked from A (for each tuple in sampled B table). In the above, we set the number of tuples to be sampled from B to be 1000. We set the y_param to 1.


In [13]:
# Display the number of tuples in the sampled datasets
len(sample_A), len(sample_B)

Now, the input tables A and B have been down sampled to smaller tables sample_A and sample_B .


In [ ]:
# Show the metadata of sample_A, sample_B
em.show_properties(sample_A)

In [ ]:
em.show_properties(sample_B)

Note that the sampled tables retain the same properies as the input tables.